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ABSTRACT 

The generalized graded unfolding model (GGUM) ( J. Roberts, 

J. Donoghue, and J. Laughlin, 1998) is an item response theory model designed 
to analyze binary or graded responses that are based on a proximity relation. 
The purpose of this study was to assess conditions under which item parameter 
estimation accuracy increases or decreases, with special attention paid to 
the influence that a given item parameter value has on the estimation of 
another item parameter. This assessment was based on a recovery simulation in 
which the effects of sample size, item location, degree of item 
discrimination, and extremity of subjective category thresholds were varied. 
Results indicate that with 750 or more respondents, sample size has 
negligible effect on all but the estimation of subjective response category 
thresholds. The true extremity of both item location and item discrimination 
did affect the estimation of these parameters themselves, and also affected 
the estimation of other item parameters in the model. However, these effects 
were modest and had little impact on the estimation of the corresponding item 
response functions. These results suggest that marginal maximum likelihood 
estimates of item parameters will provide accurate results across a variety 
of item parameter configurations when the sample size is at or above the 
recommended levels. (Contains 1 table, 3 figures, and 14 references.) 
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Abstract 



The generalized graded unfolding model (GGUM; Roberts, Donoghue & Laughlin, 1998, 
1999) is an item response theory model designed to analyze binary or graded responses that are 
based on a proximity relation (Coombs, 1964). A typical application of the GGUM is in 
measurement situations where respondents are asked to indicate their level of agreement with a 
series of statements that span a bipolar attitude continuum (e.g., Thurstone or Likert attitude 
measurement scales). 

Roberts, Donoghue, and Laughlin (1998) have shown that when the data conform to the 
GGUM, then accurate item parameter estimates can be obtained with a marginal maximum 
likelihood procedure when the sample size is approximately 750 or more. Similarly, Roberts et al. 
have demonstrated that accurate expected a posteriori estimates of person parameters can be 
obtained with approximately 20 items with 6 response categories per item. Although the 
minimum data demands associated with these estimation procedures have been investigated, other 
characteristics about the robustness of parameter estimation accuracy remain unanswered. 

The purpose of this study was to assess conditions under which item parameter estimation 
accuracy increases or decreases, with special attention paid to the influence that a given item 
parameter value has on the estimation of another item parameter. This assessment was based on a 
recovery simulation in which the effects of sample size, item location, degree of item 
discrimination and extremity of subjective category thresholds were varied. The results indicated 
that with 750 or more respondents, sample size had negligible effects on all but the estimation of 
subjective response category thresholds. Additionally, the true extremity of both item location 
and item discrimination did affect the estimation of these parameters themselves, and also affected 
the estimation of other item parameters in the model. However, these effects were modest and 
had little impact on the estimation of the corresponding item response function. These results 
suggest that marginal maximum likelihood estimates of item parameters will provide accurate 
results across a variety of item parameter configurations when the sample size is at or above the 
recommended levels. 
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Educational researchers typically use self-report questionnaires to assess attitudes toward or 
preferences for a variety of stimuli (e.g., attitude toward mathematics, preference for alternative 
types of instruction, etc.). Such questionnaires often contain a graded disagree-agree response 
format to gauge the level of individual agreement to a series of statements that range in content 
from negative, to neutral, to positive opinions. Several researchers (Andrich, 1996; Roberts, 

1995; Roberts & Laughlin, 1996a, 1996b; Roberts, Laughlin & Wedell, 1999; van Schuur & 

Kiers, 1994) have argued that graded disagree-agree responses are generally more consistent with 
an unfolding model of the response process rather than the more popular cumulative model. 
Unfolding models are proximity models which imply that higher item scores, indicative of stronger 
levels of agreement, are more probable as the distance between an individual and an 
item on the underlying latent continuum decreases (Coombs, 1964). 

Roberts and colleagues (Roberts, 1995; Roberts, Donoghue & Laughlin, 1998, 1999; Roberts 
& Laughlin, 1996a, 1996b) have developed a family of item response theory models that 
implement an unfolding response mechanism. The most general of these models is called the 
Generalized Graded Unfolding Model (GGUM). The GGUM defines the probability that the jth 
respondent will choose the kth response category to the ith item as: 
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for z = 0, 1, ..., C; where 0j is the location of the jth individual on the latent continuum; 6; is the 
location of the ith item on the latent continuum; a { is the discrimination parameter for the ith item; 
t* is the kth subjective response category threshold for the ith item; C is the number of observable 
response categories minus l;Mis equal to 2*C+1. Note that in the context of attitude 
measurement, 0j is an index of the jth individual’s attitude, and 6; is an indicator of the ith item’s 
affective content. 



The GGUM leads to single-peaked, bell-shaped response functions that imply higher levels of 
agreement to the extent that the individual and the item are close to each other on the latent 
continuum. The GGUM is more general than other unfolding item response theory models in that 
it allows items to vary in their discrimination capabilities via ct;, and it allows subjects to utilize the 
response scale differently for each item via x*. Figure 1 illustrates how the a ; and x* parameters 
affect the item response function under the GGUM. In the two upper panels of Figure 1, the 
value of ct; is relatively low (i.e., .59) whereas it is relatively high (i.e., 1.63) in the two lower 
panels. Similarly, the distance between successive x* values is lower (i.e., .25) for the two panels 
on the left-hand side of Figure 1, whereas this distance is higher (i.e., .57) for the two panels 
shown on the right-hand side of the figure. As shown in Figure 1, the item response function 
under the GGUM has a larger maximum value and becomes more peaked when the value of 
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Expected Value H Expected Value 



a= .59 

T= -1.25, -1.0, -.75, -.5, -.25 




6-6 



a = .59 

T = -2.85, -2.28, -1.71, -1.14, -.57 




<9 - <5 



a = 1.63 

-1.25, -1.0, -.75, -.5, -.25 




6 - 6 



a = 1.63 

T= -2.85, -2.28, -1.71, -1.14, -.57 




<9 - 6 



Figure 1. Item response functions for four hypothetical items under the GGUM. Items vary with regard to Oj (.59 or 
1.63) and the distance between successive t 4 values (.25 or .57). Each hypothetical item has 6 possible response 
categories that range from 0 to 5. 
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the item discrimination parameter increases. When the distance between successive values 
increases, the item response function also achieves a larger maximum value but simultaneously 
becomes more diffuse. Thus, the values of ctj and t* have distinctively different effects on the 
shape of the item response function under the GGUM. 

Roberts et al. (1998) have shown that when data conform to the GGUM, item parameters can 
be accurately estimated using a marginal maximum likelihood (MML) technique (Bock & Aitkin, 
1981; Muraki, 1992) with samples of 750 or more respondents. Additionally, person locations 
(i.e., attitudes or ideal points) can be accurately estimated with an expected a posteriori technique 
(Bock & Mislevy, 1982) when there are approximately 20 items with 6 graded disagree-agree 
response categories per item. Although these minimum data demands required to estimate 
parameters of the GGUM have been demonstrated, further questions about parameter estimability 
remain. Specifically, there is currently no information on the degree to which estimates of one 
parameter are affected by the values of other parameters in the model. One could speculate that 
the estimability of a given type of item parameter might degrade when values of the other 
parameters in the model get too large or too small. For example, it is important to know whether 
estimates of two identical item locations will be stable if the discrimination parameters associated 
with the two items vary widely. Similar questions can be asked with regard to estimates of each 
item parameter. 

The primary aim of this study was to determine the accuracy of item parameter estimates as 
the values of remaining item parameters were varied in a systematic fashion. The utility of the 
GGUM model rests on the ability to accurately estimate model parameters, and thus, this issue is 
important to the advancement of this class of models in educational research. 

Method 

A parameter recovery simulation was used to study the effects of the following variables on 
the recovery of item parameters: 

1) sample size (750 or 2000) 

2) item location (-2, -1, 0, +1, +2) 

3) discrimination (low=.59 ; high=l .63) 

4) distance between successive subjective response category thresholds (a.k.a. 
interthreshold distance; low=.25; high=.57) 

The values used for high and low discrimination and interthreshold distance corresponded to 
those observed in real data, whereas the range of item locations was chosen to be similar to past 
simulation research on item parameter recovery. The levels of each item parameter variable were 
crossed to produce a set of 20 items, and these same 20 items were studied under the 2 sample 
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size conditions. Responses to the 20 items were generated according to the GGUM, and item 
parameters were subsequently estimated using the MML technique. The true 0j parameters used 
in the simulation were drawn from a N(0,1) distribution and were integrated out of the likelihood 
equation using a standard normal prior distribution (Bock & Aitkin, 1981; Roberts et al., 1998). 
Integration was accomplished with numerical quadrature using 30 equally spaced quadrature 
points ranging from -4 to +4. Convergence of parameter estimates was operationally defined as a 
change of less than .001 in any item parameter from one iteration of the MML algorithm to the 
next. This process of generating data and estimating parameters was independently replicated 30 
times in each sample size condition using the same true values of person parameter estimates. 

The recovery simulation formed a 2 (sample size) x 5 (item location) x 2 (item discrimination) 
x 2 (subjective response category thresholds) factorial design. Main and interaction effects in this 
design were analyzed using an analysis of variance (ANOVA) model. Four dependent measures 
were analyzed with the ANOVA model. The first three measures were the root mean squared 
error (RMSE) between the true and estimated values of the three types of item parameters in the 
GGUM (i.e., the RMSE for estimates of 6j, a v and T a ) The fourth dependent variable was the 
average absolute deviation (AAD) between estimated and actual response functions over the 
interval of 0j = (-3.0, +3.0). 

The RMSE for the subjective response category threshold parameters associated with the ith 
item was calculated as: 



where x jk and x jk denote the estimated and true values for the kth subjective response category 
threshold, respectively. The RMSE for the ith item location was calculated as: 
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and that for the ith discrimination was computed as: 



RMSE = ^/(a - a, ) 2 . 



( 4 ) 



Therefore, the RMSE for the 6; and parameters was simply equal to the absolute deviation 
between estimated and true parameter values. Note that there was an RMSE value corresponding 



to each of the three types of parameters associated with a given item in a particular replication. 

The fourth dependent variable, the AAD, was analyzed in an analogous ANOVA model. 

Some researchers (Hulin, Lissak & Drasgow, 1982; Linn, Levine, Hastings & Wardrop, 1981) 
have suggested that this measure is more relevant than those which examine the accuracy of a 
single type of item parameter. This view stems from the fact that inaccuracies in estimates may 
cancel out across alternative item parameters, and thus, the estimated item response function 
might still be quite precise in cases where the accuracy of specific item parameters is questionable. 



Due to the large number of observations used in these ANOVAs (N=1200), there was a 
substantial degree of power to detect both main effects and interactions in the 4 dependent 
measures. Therefore, only those effects that were statistically significant (p<0125) and had a 
reasonable effect size (r| 2 > .05) were interpreted. The type I error rate was set using a 
Bonferroni correction to control for the fact that there were 4 dependent measures examined 
using the same univariate ANOVA model (a = .05/4 = .0125). The q 2 index was defined as the 
sum of squares for a given effect divided by the total sum of squares for the dependent variable. 



Results 

Qyerall Prediction of Individual Parameters and AAD 

The average RMSE for 8 ; , a t , parameters was equal to .07, .06, and .16, respectively. This 
represented 4.8%, 1 1.2% and 20.3% of the standard deviations of corresponding true item 
parameters. Thus, the parameters were hardest to accurately estimate, although all item 
parameters were estimated to a reasonably accurate degree. The average AAD value was equal 
to .064, and thus, the average degree of inaccuracy associated with a given item response function 
was quite small relative to the range of the response scale (e.g., from 0 to 5). 

ANOVA Results 

The r| 2 values and the significance level corresponding to each ANOVA effect are shown in 
Table 1 for the 4 dependent measures of interest. Only the main effects of sample size, true b b and 
true a; had effects that met interpretability criteria for any of the dependent measures. Given the 
high power associated with this design, the q 2 feature of the interpretability criterion was the 
dominant feature (i.e., all effects with q 2 > .05 were statistically significant at the p < .0125 level). 



Effects of Sample Size 

As expected, all item parameter estimates were more accurately estimated with the larger 
sample size of 2000 rather than the recommended sample size of 750. However, the increased 
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Table 1 . r\ 2 values from ANOVAs on estimation accuracy indices. 



RMSE 

Source Effect 8 at AAD 



N 


.04 


.04 


.12 


.19 


6 


.11 


.01 


.05 


.02 


N x 8 


.03 


.01 


.02 


.01 


a 


.05 


.16 


.24 


.05 


N x a 


.01 


.01 


.04 


.01 


8 x a 


.03 


.01 


.01 


.00 


Nx 8 x a 


.01 


.01 


.01 


.00 


T 


.00 


.01 


.00 


.00 


N X T 


.00 


.00 


.00 


.00 


8 X T 


.00 


.01 


.01 


.04 


N x 8 x t 


.00 


.01 


.00 


.01 


a x t 


.00 


.00 


.00 


.00 


Nx a x t 


.00 


.00 


.00 


.00 


8 x a x t 


.00 


.01 


.00 


.01 


N x 8 x a x t 


.00 


.02 


.00 


.01 



Statistically significant effects are given in bold type. The type I error rate was equal to .0 1 25. An effect was deemed 
worthy of interpretation when it was both statistically significant and accounted for at least 5% of the variation in a given 
dependent measure. N=sample size. 
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0.25 




750 2000 

Sample Size 



AAD 

0.101 




750 2000 

Sample Size 



Figure 2. The effect of sample size on average RMSE values for t 4 (top panel) and mean AAD indices (bottom panel). 
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accuracy reached interpretable levels only for the RMSE of-c*. As shown in the top panel of 
Figure 2, the average RMSE for this parameter was equal to . 1 1 when N=2000 and grew to .21 
when N=750. These RMSE values represented 14. 1% and 26.4% of the standard deviation of 
true tfc parameters, respectively. 

The sample size also had an interpretable effect on the mean AAD measures as shown in the 
lower panel of Figure 2. However, the difference between AAD measures obtained in the two 
sample size conditions was only .03. Thus, this difference seemed minor in the context of the 6- 
point response scale. 

Effects of True ctj 

Figure 3 illustrates the effects of true ct; on the accuracy of item parameter estimates and the 
estimated item response function. The accuracy of all parameter estimates was affected by the 
degree of item discrimination. Both 6; and x* were estimated more poorly when the true a; was 
low rather than high (RMSE 6; : . 10 versus .04; RMSE t* : .23 versus .09). This “cross 
parameter effect” suggested that and were more difficult to estimate when the true item 
response function was more diffuse. Interestingly, a; itself was more accurately estimated when 
it was low rather than high (RMSE ct; : .03 vs .08). This finding constituted a “same parameter 
effect.” Thus, item discriminations were harder to estimate when they were relatively high, yet 
item locations and subjective category thresholds were more precisely estimated when item 
discriminations were high. 

The mean AAD index also varied systematically as a function of the true value of a ; . 
Specifically, the mean AAD index was slightly higher for the low discrimination condition relative 
to the high discrimination condition (e.g., .07 versus .06). Although this effect suggested more 
precise estimation of item response functions in the high discrimination condition, the difference is 
inconsequential when interpreted within the context of the 6-point response scale. 

Effects of True 6 ; 

As shown in Figure 4, the ability to accurately estimate both 6 ; and t* parameters was affected 
by the extremity of 6 ; . Specifically, the average RMSE of 6 ; estimates increased as the absolute 
value of the corresponding true 6 ; increased (i.e., a same parameter effect). The range of this 
increase was equal to . 10. Similarly, the inaccuracy of t* estimates also increased as the extremity 
of true 6 ; grew larger (i.e., a cross parameter effect). The average RMSE for t* estimates 
increased from .13 to .20 as 6 ; became more extreme. Both the same parameter and cross 
parameter effects associated with true 6 ; were presumably due to the fewer number of subjects 
(i.e., the smaller amount of information) available at the more extreme positions on the latent 
continuum. 
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Mean RMSE Mean RMSE 



a RMSE 



6 RMSE 




True Alpha 




0.59 1.63 

True Alpha 



T RMSE 




True Alpha 



AAD 




0.59 1.63 

True Alpha 



Figure 3. The effect of true a 1 on average RMSE values for a { (upper left panel), 6, (upper right panel), and (lower 
left panel) parameter estimates. The corresponding effect on mean AAD values is shown in the lower right panel. 



Mean RMSE Mean RMSE 



(5 RMSE 




-1 0 1 

True Delta 



t RMSE 




-1 0 1 

True Delta 



Figure 4. The effect of true 5; on the mean RMSE values for (upper panel) and T a (lower panel) parameter estimates. 



Discussion 



The results of this research suggest that the inaccuracy of a given GGUM item parameter 
estimate will vary depending on the true value of the item parameter in question (same parameter 
effect) and on the true values of other item parameters in the model (cross parameter effects). 
However, the degree of both types of parameter estimation inaccuracy will typically be modest. 
These results also imply that both types of parameter estimation inaccuracy will have negligible 
effects on the precision of estimated item response functions. Thus, while interrelationships 
among parameters will reliably affect their estimation and the ultimate representation of the 
response function, these effects will generally be small in magnitude and will have little practical 
consequences as long as 1) the data truly conform to the GGUM, 2) a sufficiently large sample of 
respondents (i.e., N>750) is used in the MML estimation procedure, and 3) the distributions of 
true parameters are similar to those used here. Obviously, one must be cognizant of the limited 
generality of a single simulation study, but these results do suggest that MML estimation of 
GGUM item parameters and corresponding item response functions are fairly robust to both same 
parameter and cross parameter effects. 

The ability to accurately estimate GGUM item parameters is a prerequisite for the general 
application of the model in applied measurement situations. These results provide further support 
that all of the GGUM item parameters are estimable given sufficient sample sizes. Moreover, 
given accurate parameter estimates and a well-fitting model, the GGUM should provide the same 
benefits of other item response theory models such as 1) item parameter invariance, 2) person 
parameter invariance, and 3) the ability to estimate the precision of a single individual’s attitude 
estimate. These benefits set the stage for other measurement advantages associated with item 
response theory models such as item banking and computer adaptive testing. Therefore, the 
GGUM should prove to be a useful tool for large scale assessment of attitudes and preferences in 
educational settings. 
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